Alignment-free Sequence Analysis Using Extensible Markov Models
نویسندگان
چکیده
Profile models based on Hidden Markov Models (HMM) for sequence studies have gained visibility among researchers. While the mathematical foundation, the proven algorithms such as Viterbi, Forward and Backward algorithms have certainly provided a rigorous probabilistic platform, the requirement of classic alignment has ensured an extremely high time complexity. We propose the use of another kind of Markov model called Extensible Markov Models (EMM) to create profile architectures that are more efficient in space and time complexity than their HMM counter parts. EMM efficiency comes from an alignment-free paradigm through use of an improved statistical signature form of sequences. The EMM aproach is based on the use sliding p-mers that count every possible p-mer pattern along equal sized segments of a sequence which are then clustered into Markov states. The resulting count vectors shift the position based letter-by-letter sequence analysis problem for phylogenetic trees, classification and search to a more efficient numerical vector space. Using adapted Karlin-Altschul statistics from the Basic Local Alignment Search Tool (BLAST) literature, the EMM based sequence classification also computes a p-value for statistical significance. We present a comparison between profiles generated using profile HMM and EMM.
منابع مشابه
Analyzing taxonomic classification using extensible Markov models
MOTIVATION As next generation sequencing is rapidly adding new genomes, their correct placement in the taxonomy needs verification. However, the current methods for confirming classification of a taxon or suggesting revision for a potential misplacement relies on computationally intense multi-sequence alignment followed by an iterative adjustment of the distance matrix. Due to intra-heterogenei...
متن کاملA generalization of Profile Hidden Markov Model (PHMM) using one-by-one dependency between sequences
The Profile Hidden Markov Model (PHMM) can be poor at capturing dependency between observations because of the statistical assumptions it makes. To overcome this limitation, the dependency between residues in a multiple sequence alignment (MSA) which is the representative of a PHMM can be combined with the PHMM. Based on the fact that sequences appearing in the final MSA are written based on th...
متن کاملxREI: a phylo-grammar visualization webserver
Phylo-grammars, probabilistic models combining Markov chain substitution models with stochastic grammars, are powerful models for annotating structured features in multiple sequence alignments and analyzing the evolution of those features. In the past, these methods have been cumbersome to implement and modify. xrate provides means for the rapid development of phylo-grammars (using a simple fil...
متن کاملAn Analysis of Continuous Time Markov Chains using Generator Matrices
This paper mainly analyzes the applications of the Generator matrices in a Continuous Time Markov Chain (CTMC). Hidden Markov models [HMMs] together with related probabilistic models such as Stochastic Context-Free Grammars [SCFGs] are the basis of many algorithms for the analysis of biological sequences. Combined with the continuous-time Markov chain theory of likelihood based phylogeny, stoch...
متن کاملUsing evolutionary Expectation Maximization to estimate indel rates
MOTIVATION The Expectation Maximization (EM) algorithm, in the form of the Baum-Welch algorithm (for hidden Markov models) or the Inside-Outside algorithm (for stochastic context-free grammars), is a powerful way to estimate the parameters of stochastic grammars for biological sequence analysis. To use this algorithm for multiple-sequence evolutionary modelling, it would be useful to apply the ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010